19 research outputs found
Structured Learning of Tree Potentials in CRF for Image Segmentation
We propose a new approach to image segmentation, which exploits the
advantages of both conditional random fields (CRFs) and decision trees. In the
literature, the potential functions of CRFs are mostly defined as a linear
combination of some pre-defined parametric models, and then methods like
structured support vector machines (SSVMs) are applied to learn those linear
coefficients. We instead formulate the unary and pairwise potentials as
nonparametric forests---ensembles of decision trees, and learn the ensemble
parameters and the trees in a unified optimization problem within the
large-margin framework. In this fashion, we easily achieve nonlinear learning
of potential functions on both unary and pairwise terms in CRFs. Moreover, we
learn class-wise decision trees for each object that appears in the image. Due
to the rich structure and flexibility of decision trees, our approach is
powerful in modelling complex data likelihoods and label relationships. The
resulting optimization problem is very challenging because it can have
exponentially many variables and constraints. We show that this challenging
optimization can be efficiently solved by combining a modified column
generation and cutting-planes techniques. Experimental results on both binary
(Graz-02, Weizmann horse, Oxford flower) and multi-class (MSRC-21, PASCAL VOC
2012) segmentation datasets demonstrate the power of the learned nonlinear
nonparametric potentials.Comment: 10 pages. Appearing in IEEE Transactions on Neural Networks and
Learning System
Coarse-to-Fine: Learning Compact Discriminative Representation for Single-Stage Image Retrieval
Image retrieval targets to find images from a database that are visually
similar to the query image. Two-stage methods following retrieve-and-rerank
paradigm have achieved excellent performance, but their separate local and
global modules are inefficient to real-world applications. To better trade-off
retrieval efficiency and accuracy, some approaches fuse global and local
feature into a joint representation to perform single-stage image retrieval.
However, they are still challenging due to various situations to tackle,
, background, occlusion and viewpoint. In this work, we design a
Coarse-to-Fine framework to learn Compact Discriminative representation (CFCD)
for end-to-end single-stage image retrieval-requiring only image-level labels.
Specifically, we first design a novel adaptive softmax-based loss which
dynamically tunes its scale and margin within each mini-batch and increases
them progressively to strengthen supervision during training and intra-class
compactness. Furthermore, we propose a mechanism which attentively selects
prominent local descriptors and infuse fine-grained semantic relations into the
global representation by a hard negative sampling strategy to optimize
inter-class distinctiveness at a global scale. Extensive experimental results
have demonstrated the effectiveness of our method, which achieves
state-of-the-art single-stage image retrieval performance on benchmarks such as
Revisited Oxford and Revisited Paris. Code is available at
https://github.com/bassyess/CFCD.Comment: Accepted to ICCV 202
Collaborative Noisy Label Cleaner: Learning Scene-aware Trailers for Multi-modal Highlight Detection in Movies
Movie highlights stand out of the screenplay for efficient browsing and play
a crucial role on social media platforms. Based on existing efforts, this work
has two observations: (1) For different annotators, labeling highlight has
uncertainty, which leads to inaccurate and time-consuming annotations. (2)
Besides previous supervised or unsupervised settings, some existing video
corpora can be useful, e.g., trailers, but they are often noisy and incomplete
to cover the full highlights. In this work, we study a more practical and
promising setting, i.e., reformulating highlight detection as "learning with
noisy labels". This setting does not require time-consuming manual annotations
and can fully utilize existing abundant video corpora. First, based on movie
trailers, we leverage scene segmentation to obtain complete shots, which are
regarded as noisy labels. Then, we propose a Collaborative noisy Label Cleaner
(CLC) framework to learn from noisy highlight moments. CLC consists of two
modules: augmented cross-propagation (ACP) and multi-modality cleaning (MMC).
The former aims to exploit the closely related audio-visual signals and fuse
them to learn unified multi-modal representations. The latter aims to achieve
cleaner highlight labels by observing the changes in losses among different
modalities. To verify the effectiveness of CLC, we further collect a
large-scale highlight dataset named MovieLights. Comprehensive experiments on
MovieLights and YouTube Highlights datasets demonstrate the effectiveness of
our approach. Code has been made available at:
https://github.com/TencentYoutuResearch/HighlightDetection-CLCComment: Accepted to CVPR202
Scene Consistency Representation Learning for Video Scene Segmentation
A long-term video, such as a movie or TV show, is composed of various scenes,
each of which represents a series of shots sharing the same semantic story.
Spotting the correct scene boundary from the long-term video is a challenging
task, since a model must understand the storyline of the video to figure out
where a scene starts and ends. To this end, we propose an effective
Self-Supervised Learning (SSL) framework to learn better shot representations
from unlabeled long-term videos. More specifically, we present an SSL scheme to
achieve scene consistency, while exploring considerable data augmentation and
shuffling methods to boost the model generalizability. Instead of explicitly
learning the scene boundary features as in the previous methods, we introduce a
vanilla temporal model with less inductive bias to verify the quality of the
shot features. Our method achieves the state-of-the-art performance on the task
of Video Scene Segmentation. Additionally, we suggest a more fair and
reasonable benchmark to evaluate the performance of Video Scene Segmentation
methods. The code is made available.Comment: Accepted to CVPR 202
D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation
Temporal sentence grounding (TSG) aims to locate a specific moment from an
untrimmed video with a given natural language query. Recently, weakly
supervised methods still have a large performance gap compared to fully
supervised ones, while the latter requires laborious timestamp annotations. In
this study, we aim to reduce the annotation cost yet keep competitive
performance for TSG task compared to fully supervised ones. To achieve this
goal, we investigate a recently proposed glance-supervised temporal sentence
grounding task, which requires only single frame annotation (referred to as
glance annotation) for each query. Under this setup, we propose a Dynamic
Gaussian prior based Grounding framework with Glance annotation (D3G), which
consists of a Semantic Alignment Group Contrastive Learning module (SA-GCL) and
a Dynamic Gaussian prior Adjustment module (DGA). Specifically, SA-GCL samples
reliable positive moments from a 2D temporal map via jointly leveraging
Gaussian prior and semantic consistency, which contributes to aligning the
positive sentence-moment pairs in the joint embedding space. Moreover, to
alleviate the annotation bias resulting from glance annotation and model
complex queries consisting of multiple events, we propose the DGA module, which
adjusts the distribution dynamically to approximate the ground truth of target
moments. Extensive experiments on three challenging benchmarks verify the
effectiveness of the proposed D3G. It outperforms the state-of-the-art weakly
supervised methods by a large margin and narrows the performance gap compared
to fully supervised methods. Code is available at
https://github.com/solicucu/D3G.Comment: ICCV202
Unified and Dynamic Graph for Temporal Character Grouping in Long Videos
Video temporal character grouping locates appearing moments of major
characters within a video according to their identities. To this end, recent
works have evolved from unsupervised clustering to graph-based supervised
clustering. However, graph methods are built upon the premise of fixed affinity
graphs, bringing many inexact connections. Besides, they extract multi-modal
features with kinds of models, which are unfriendly to deployment. In this
paper, we present a unified and dynamic graph (UniDG) framework for temporal
character grouping. This is accomplished firstly by a unified representation
network that learns representations of multiple modalities within the same
space and still preserves the modality's uniqueness simultaneously. Secondly,
we present a dynamic graph clustering where the neighbors of different
quantities are dynamically constructed for each node via a cyclic matching
strategy, leading to a more reliable affinity graph. Thirdly, a progressive
association method is introduced to exploit spatial and temporal contexts among
different modalities, allowing multi-modal clustering results to be well fused.
As current datasets only provide pre-extracted features, we evaluate our UniDG
method on a collected dataset named MTCG, which contains each character's
appearing clips of face and body and speaking voice tracks. We also evaluate
our key components on existing clustering and retrieval datasets to verify the
generalization ability. Experimental results manifest that our method can
achieve promising results and outperform several state-of-the-art approaches
Mid-level representations for action recognition and zero-shot learning
Compared with low-level features, mid-level representations of visual objects contain
more discriminative and interpretable information and are beneficial for improving
performance of classification and sharing learned information across object
categories. These benefits draw tremendous attention of the computer vision communities
and lots of breakthroughs have been made for various computer vision
tasks with mid-level representations. In this thesis, we focus on the following problems
regarding mid-level representations: 1) How to extract discriminative mid-level
representations from local features? 2) How to suppress noisy components from
mid-level representations? 3) And how to address the issue of visual-semantic discrepancy
in mid-level representations? We deal with the first problem in the task of
action recognition and the other two problems in the task of zero-shot learning.
For the first problem, we devise a representation suitable for characterising human
actions on the basis of a sequence of pose estimates generated by an RGB-D
sensor. We show that discriminate sequence of poses typically occur over a short
time window, and thus we propose a simple-but-effective local descriptor called a
trajectorylet to capture the static and kinematic information within this interval. We
also show that state of the art recognition results can be achieved by encoding each
trajectorylet using a discriminative trajectorylet detector set which is selected from a
large number of candidate detectors trained through exemplar-SVMs. The mid-level
representation is obtained by pooling trajectorylet encodings.
For the second problem, we follow the attractive research topic zero-shot learning
and focus on classifying a visual concept merely from its associated online textual
source, such as a Wikipedia article. We go further to consider one important factor:
the textual representation as a mid-level representation is usually too noisy for the
zero-shot learning tasks. We design a simple yet effective zero-shot learning method
that is capable of suppressing noise in the text. Specifically, we propose an l₂‚₁-norm
based objective function which can simultaneously suppress the noisy signal in the
text and learn a function to match the text document and visual features. We also
develop an optimization algorithm to efficiently solve the resulting problem.
For the third problem, we observe that distributed word embeddings, which become
a popular mid-level representation for zero-shot learning due to their easy
accessibility, are designed to reflect semantic similarity rather than visual similarity
and thus using them in zero-shot learning often leads to inferior performance To overcome this visual-semantic discrepancy, we here re-align the distributed word
embedding with visual information by learning a neural network to map it into a
new representation called the visually aligned word embedding (VAWE). We further
design an objective function to encourage the neighbourhood structure of VAWEs to
mirror that in the visual domain. This strategy gives more freedom in learning the
mapping function and allows the learned mapping function to generalize to zeroshot
learning methods and different visual features.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 201